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CACHE STRUCTURE FOR STORING 
VARIABLE LENGTH DATA 

BACKGROUND 5 

The present invention relates to a cache architecture for 
variable length data. When used in a processor core, the 
cache architecture can support storage of variable length 
instruction segments and can retrieve multiple instruction 
segments (or portions thereof) in a single clock cycle. The 10 
cache architecture also contributes to minimized fragmen- 
tation of the instruction segments. 

FIG. 1 is a block diagram illustrating the process of 
program execution in a conventional processor. Program 
execution may include three stages: front end 110, execution 15 
120 and memory 130. The front-end stage 110 performs 
instruction pre-processing. Front end processing 110 is 
designed with the goal of supplying valid decoded instruc- 
tions to an execution unit 120 with low latency and high 
bandwidth. Front-end processing 110 can include instruction 20 
prediction, decoding and renaming. As the name implies, the 
execution stage 120 performs instruction execution. The 
execution stage 120 typically communicates with a memory 
130 to operate upon data stored therein. 

Conventionally, front end processing 110 may build 
instruction segments from stored program instructions to 
reduce the latency of instruction decoding and to increase 
front-end bandwidth. Instruction segments are sequences of 
dynamically executed instructions that are assembled into 3Q 
logical units. The program instructions may have been 
assembled into the instruction segment from non -contiguous 
regions of an external memory space but, when they are 
assembled in the instruction segment, the instructions appear 
in program order. The instruction segment may include ^ 
instructions or uops (micro-instructions). 

A trace is perhaps the most common type of instruction 
segment. Typically, a trace may begin with an instruction of 
any type. Traces have a single entry, multiple exit architec- 
ture. Instruction flow starts at the first instruction but may 40 
exit the trace at multiple points, depending on predictions 
made at branch instructions embedded within the trace. The 
trace may end when one of number of predetermined end 
conditions occurs, such as a trace size limit, the occurrence 
of a maximum number of conditional branches or the 4 j 
occurrence of an indirect branch or a return instruction. 
Traces typically are indexed by the address of the first 
instruction therein. 

Other instruction segments are known. The inventors 
have proposed an instruction segment, which they call an 50 
"extended block,'* that has a different architecture than the 
trace. The extended block has a multiple-entry, single-exit 
architecture. Instruction flow may start at any point within 
an extended block but, when it enters the extended block, 
instruction flow must progress to a terminal instruction in 55 
the extended block. The extended block may terminate on a 
conditional branch, a return instruction or a size limit. The 
extended block may be indexed by the address of the last 
instruction therein. 

A "basic block" is another example of an instruction 60 
segment. It is perhaps the most simple type of instruction 
segment available. The basic block may terminate on the 
occurrence of any kind of branch instruction, including an 
unconditional branch. The basic block may be characterized 
by a single-entry, single -exit architecture. Typically, the 65 
basic block is indexed by the address of the first instruction 
therein. 
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Regardless of the type of instruction segment used in a 
processor 110, the instruction segment typically is cached 
for later use. Reduced latency is achieved when program 
flow returns 10 the instruction segment because the instruc- 
tion segment may store instructions already assembled in 
program order. The instructions in the cached instruction 
segment may be furnished to the execution stage 120 faster 
than they could be furnished from different locations in an 
ordinary instruction cache. 

Caches typically have a predetermined width; the width 
determines the maximum amount of data that could be 
retrieved from cache in a single clock cycle. The width of a 
segment cache typically determines the maximum size of the 
instruction segment. To retrieve data, a cache address is 
supplied to the cache, which causes contents of a cache entry 
to be driven to a cache output. 

Because instruction segments are terminated based on the 
content of the instructions from which tbey are built, the 
instruction segments typically have variable length. So, 
while a segment cache may have capacity to store, say, 16 
instructions per segment, the average length of the instruc- 
tions segments may be much shorter than this maximum 
length. In fact, in many typical applications, an average 
instruction segment length is slightly more than 8 instruc- 
tions per segment. If these instruction segments were stored 
in a traditional segment cache, the capacity of the segment 
cache may be under-utilized; the 8-instruction segment 
would prevent excess capacity in a much larger cache line 
from storing other data. Further, a traditional segment cache 
would output the smaller instruction segment, when 
addressed, even though it may have the capacity for much 
larger data items. 

Accordingly, there exists a need in the art for a cache 
structure that stores variable length data and can output data 
with higher utilization than would be provided by a tradi- 
tional cache. 

BRIEF DESCRIPTION OF THE DRAWINGS 

FIG. 1 is a block diagram illustrating the process of 
program execution in a conventional processor. 

FIG. 2 is a block diagram of a front end processing system 
according to an embodiment of the present invention. 

FIG. 3 is a block diagram of a segment cache according 
to an embodimenl of the present invention. 

FIG. 4 illustrates a relationship between exemplary seg- 
ment instructions a cache bank according to the embodi- 
ments of the present invention. 

FIG. 5 illustrates exemplary operation of a cache accord- 
ing to an embodiment of the present invention. 

FIG. 6 illustrates exemplary operation of a cache accord- 
ing to an embodiment of the present invention. 

FIG. 7 illustrates exemplary operation of a reassembler 
according to an embodimenl of the present invention. 

FIG. 8 is a block diagram of a cache according to an 
embodiment of the present invention. 

FIG. 9 illustrates exemplary operation of a cache accord- 
ing to an embodiment of the present invention. 

DETAILED DESCRIPTION 

Embodiments of the present invention provide a cache 
architecture adapted to store data items of variable length. 
The cache may be populated by a number of independently 
addressable banks. If a data item occupies fewer than the 
total number of banks, unoccupied banks may be used to 
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store other data items. The cache architecture contributes to 
higher utilization because data from multiple instruction 
segments may be read from a cache simultaneously. 

FIG. 2 is a block diagram of a front end processing system 
200 according to an embodiment of the present invention. 5 
The front end 200 may include an instruction cache 210 and 
an instruction segment engine ("ISE") 220. The instruction 
cache 210 may be based on any number of known archi- 
tectures for front-end systems 200. Typically, they include 
an instruction cache or memory 230, a branch prediction unit 10 
("BPU") 240 and an instruction decoder 250, Program 
instructions may be stored in the cache memory 230 and 
indexed by an instruction pointer. Instructions may be 
retrieved from the cache memory 230, decoded by the 
instruction decoder 250 and passed to the execution unit (not 15 
shown). The BPU 240 may assist in the selection of instruc- 
tions to be retrieved from the cache memory 230 for 
execution. As is known, instructions may be indexed by an 
address, called an "instruction pointer" or "IP." 

According to an embodiment, an ISE 220 may include a 20 
fill unit 260, a segment prediction unit ("SPU") 270 and a 
segment cache 280. The fill unit 260 may build the instruc- 
tion segments. The segment cache 280 may store the instruc- 
tion segments. The SPU 270 may predict which instruction 
segments, if any, are likely to be executed based on a current 25 
state of program flow. It may cause the segment cache 280 
to furnish any predicted segment to the execution unit. The 
SPU 270 may generate prediction data for each of the 
instruction segments stored by the segment cache 280. 

The ISE 220 may receive decoded instructions from the 
instruction cache 210. The ISE 220 also may pass decoded 
instructions to the execution unit (not shown). A selector 290 
may select which front-end source, either the instruction 
cache 210 or the ISE 220, will supply instructions to the 35 
execution unit. In an embodiment, the segment cache 280 
may control the selector 290. 

FIG. 3 illustrates a cache 300 according to an embodiment 
of the present invention. This structure may be appropriate 
for use as the segment cache 280 of FIG. 2. According to an 40 
embodiment, the cache structure 300 may be populated by 
a number of cache banks 310.1~310.N-1. The cache banks 
310.1-310.N-1 each may include a plurality of cache lines 
311, 312, 313, 314. The sets typically have uniform width 
and may be tailored to store an integral number of instruc- 45 
tions. The cache lines 311 may maintain two fields. A first 
field, called a lag field 322, may store a tag associated with 
the data. The tag may be derived from the IP on which the 
instruction segment stored in the cache line 311 is indexed. 
The second field, called a data field 324, may store instruc- 50 
tion data from the instruction segment. 

The cache 300 may accept separate address signals for 
each of the banks (add^-addr^). In the example shown in 
FIG. 3, address decoders 320. 1-320. N-l access the cache 
lines based upon respective input addressing signals 55 
330. 1-330. N-l. Each bank 310.1-310.N-1 may be 
addressed independently of the other banks. A cache line 
(say, 311) typically is addressed by a portion of an instruc- 
tion pointer, called a "set." 

Each cache bank 310.1-310.N- 1 may include its own tag 60 
comparator 340.1, 340.2, 340.3, . . . , 340.N-1. Each tag 
comparator (say, 340.1) has two inputs. A first input is 
provided in communication with the lag fields 322 of the 
cache lines 311 in the respective bank 310.1. The tag 
comparator 340.1 will receive tag data from one of the cache 65 
lines that is addressed by the address decoder 330.1. A 
second input receives a tag portion of an externally supplied 
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address. Thus, the tag comparator 340.1 may compare an 
externally supplied lag with lag data stored in an addressed 
cache line (say, 311). When the two lags agree, the lag 
comparator 340.1 may generate an output identifying a tag 
bit. Hit/miss outputs from the lag comparators 
340.1-340.N-1 may be output to the selector 290 (FIG. 2). 

For each clock cycle, the cache 300 may output data 
having a width that is determined by the cumulative width 
of the cache lines of all the banks 310.1-310.N-1 . As noted, 
however, different cache lines in each bank may be 
addressed independently of the other. If two or more instruc- 
tion segments are stored in non-overlapping banks, it is 
possible to retrieve them from the cache 300 during a single 
clock cycle. Even when instruction segments partially over- 
lap banks, it is possible to retrieve data in excess of one 
instruction segment. 

FIG. 4 is a functional diagram illustrating a addressing 
system 400 according to an embodiment of the present 
invention. The addressing system 400 may determine how 
the various banks in the cache 300 (FIG. 3) will be 
addressed. As shown, the addressing system 400 may 
include segment predictor 410, a transaction queue 420, a 
priority encoder 430, the cache 440, a reassembler 450 and 
a cache directory 460. On each clock cycle, based on a 
current state of program flow, the segment predictor 410 
may predict one or more instruction segments that should be 
retrieved from the segment cache 280 (FIG. 2). In the 
example illustrated in FIG. 4, the segment predictor 410 is 
shown predicting the next two instruction segments; other 
implementations are possible. As its name implies, the 
transaction queue 420 may queue results from the segment 
predictor 410 until they are used. 

The priority encoder 430 retrieves the queued prediction 
results and addresses the cache 440 based on bank usage, 
FIG. 4 illustrates four separate address lines 435 intercon- 
necting the priority encoder 430 and the cache 440 to 
represent the address inputs for each bank in the cache. 
There may be a separate set of address lines for each bank 
(FIG. 3,310.1-310.N-1) in the cache 440. Data output from 
the cache 440 may be reassembled by the reassembler 450. 
An output from the reassembler 450 may be output to the 
execution stage (FIG. 1, 120). 

According to an embodiment, prediction results from the 
segment predictor 410 may include an IP of the instruction 
segment, a bank vector and a length vector. An instruction 
segment's IP may determine the set and tag data to be 
applied to the cache 440. The bank vector may identify 
which of the cache banks (3 10. 1-31 0.N- 1, FIG. 3) are to be 
addressed with the set and tag data. The length vector may 
indicate a length of data to be read from the cache. The cache 
directory 460 also may store data associated with each 
instruction segment, including an order vector. When an 
instruction segment is stored across multiple banks in the 
cache 440, the order vector may identify which bank stores 
the beginning of the instruction segment, which bank stores 
a second portion of the instruction segment, and so on. The 
order vector is useful for re-ordering the output of the cache 
to return the contents of each bank to its position in program 
order. 

According to an embodiment, the transaction queue 420 
may decouple timing relationships between the segment 
predictor 410 and the priority encoder 430. As shown in FIG. 
4, the segment predictor 410 may predict a predetermined 
number of instruction segments to retrieve on each clock 
cycle. The example in FIG. 4 shows prediction of two 
instruction segments per clock cycle. As discussed below, 
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however, the segment predictor 430 may predict a variable would not be continuous with those from the first instruction 

number of instruction segments from the transaction queue segment IS1. The instructions from the end of instruction 

420 in any given clock cycle. Bu fife ring provided by the segment IS2 could not be executed until after the instruc- 

transaction queue 420 helps to decouple these timing rela- tions in bank^,, those of the beginning of instruction 

tionships. 5 segment IS2, are executed. The contents of bank 2 cannot be 

As noted above, the cache 440 may retrieve valid data forwarded to the execution unit in this case, 

from each bank (310.1-310.N-1, FIG. 3) per clock cycle. According to an embodiment of the present invention, the 

Because the length of instruction segments may vary, it is priority encoder 430 (FIG. 4) may address the cache 440 

likely that some instruction segments will occupy less than speculatively to cause all non-conflicting banks to be read, 

the maximum number of banks that are available. A bank 1Q The data read from the cache 440 may be input to the 

that does not store valid data for the highest priority instruc- re assembler 450 along with the order vector from the cache 

tion segment, the instruction segment at the top of the directory 460 identifying bank order for each of the instruc- 

transaction queue, is free to retrieve data for another instruc- tion segments IS1, 1S2. The reassembler 450 may cause any 

tion segment. According to an embodiment, the priority data from instruction segments that cannot be reassembled 

encoder 430 may compare the bank vectors of two or more int0 a continuous instruction stream to be filtered from the 

instruction segments to determine which banks to address. 0Ul P ul of lhe addressing system 400. Thus, if a portion of the 

FIG. 5 provides an example of two such instruction sec , ond [^ruction segment IS2 cannot be integrated with the 

segments, IS1 and IS2 that could be stored in the cache 300 ™*™ c \™s *™ tnc instruction segment because for 

j • » . . . . . . . ... . . . example, a bank conflict prevents another portion of the 

and their associated bank vectors. ^ this example, it may be ^ scgmcnt P from bci rcad , & rcasscmblcr 

assumed that ISl occurs before IS2. FIG S presents an 2Q 450 c ^ ^ ^ lQ be e , in > aled from i|s 

example where there is no overlap between banks that store Data for tne second instruction segment IS2 would remain in 

instruction segment IS1 and the banks that store instruction ^ cache 440 and couId ^ retrieved in a subsequent clock 

segments IS2. Banks 0 and N-l are valid for instruction cyc i e wnen me conflict with instruction segment IS1 would 

segment IS1 and banks 1 and 2 are valid for instruction be cleared. This embodiment is advantageous because it 

segment IS2. Because there is no overlap between the bank 25 contributes to increased bandwidth — data can be read from 

vectors, the priority encoder 430 may retrieve data from all the cache 440 while the cache directory 460 decodes the 

four banks simultaneously. The two instruction segments prediction data associated with the instruction segments. 

may be retrieved in their entirety in one clock cycle. Thus, Alternatively, instead of eliminating the discontinuous por- 

FIG. 5 illustrates the output of the cache when retrieving tion of the second instruction segment IS2, the reassembler 

both instruction segments ISl and IS2 from the cache. In an 30 450 itself could include a recording mechanism such as a 

embodiment, the priority encoder 430 may compare the buffer (not shown)_to preserve the data. The data may be 

bank vectors of the two instruction segments 1S1 and IS2 to preserved until the next clock cycle when the remaining 

determine which cache lines to address for each bank. portion of the second instruction segment IS2 could be read 

Amore complex situation is presented in FIG. 6. It occurs from the cache 440. In this alternative, preserving the 
when there is partial overlap between the bank vectors. In 35 discontinuous data can increase throughput from the cache 
this case, instruction segment ISl is distributed among 440 because the discontinuous data need not be re-read from 
banks 0 and N-l as in FIG. 5 but instruction segment IS2 is the cache; the bank that stores the discontinuous data 
distributed among banks 2 and N-l. In this case, a bank possibly could be used to retrieve still other data, 
vector comparison would, indicate a conflict at bank N-l — Returning to FIG. 4, the output of the cache 440 may be 
the two instruction segments cannot be retrieved from the 4 q input to a reassembler 450. As noted, the cache 440 need not 
cache in their entirety in a single clock cycle. In this case, the output instructions in program order. The reassembler 450 
priority encoder 430 (FIG. 4) may address the cache 440 to may shift the output data as necessary to assemble a con- 
retrieve data for instruction segment ISl in its entirety; it is tinuous stream of instructions in program order from the 
first in order of program flow. The priority encoder 430 also cache output. In the embodiment shown in FIG. 5, there can 
may cause non-overlapping portions of the second instruc- 45 be two issues: First, the output of the banks may need to be 
tion segment IS2 to be retrieved from the cache as well. re -ordered to preserve instruction order. Second, the banks 
Thus, set 15 may be addressed in bank 2 . Only a portion of themselves may not be fully occupied with valid data for an 
the second instruction segment IS2 will be retrieved from instruction segment (see, for example, set 23, banko). An 
the cache, the other portion must be deferred another clock instruction segment's order vector may identify the position 
cycle. FIG. 6 illustrates the cache output when addressed in 50 of each bank within the instruction segment. A length vector 
this manner. may identify a length of the instruction segment. According 

Although a second instruction segment cannot be to an embodiment, the reassembler 450 may shift the output 

retrieved in its entirety when a bank conflict occurs, retrieval of the cache to return the contents of the banks to program 

of non-overlapping portions of an instruction segment can order. This function is illustrated in FIG. 7 using the exem- 

useful if the non -overlapping portions are continuous with 55 plary data output from the cache shown in FIG. 6. 

the preceding instruction segment, measured in terms of According to an embodiment, the reassembler 450 may be 

program flow. In the example of FIG. 6, if the instructions populated by a plurality of multiplexers (not shown) pro- 

in bankj represent the beginning of the second instruction vided in a layered arrangement. A first layer of multiplexers 

segment IS2, the instructions therein would be continuous 710 may re-order the presentation of blocks according to the 

with the end of the first instruction segment ISl. In this case, 60 order vector provided by the cache directory 460. A second 

all the instructions read from the cache 440 could be layer of multiplexers 720 may collapse the instructions 

forwarded directly to the execution unit for processing. within the blocks according to the length vector. The output 

There would be no need to wait for the remainder of the of the reassembler 450, a continuous stream of instructions 

second instruction segment IS2 to be read from the cache is assembled, the instructions may be output to the execution 

440. 65 stage 120 (FIG. 1) for processing. 

If, however, the contents of banlu represent the end of the FIG. 8 illustrates a cache structure 800 according to 

second instruction segment IS2, the instructions therein another embodiment of the present invention. The cache 800 
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may be populated by a plurality of cache banks optimized for use in a system based upon extended blocks. 

810.1-810.N-1, eacb of which may be addressed indepen- For extended blocks, prediction results may include an IP of 

dently of the other. In this embodiment, each bank in the a predicted extended block, a bank vector identifying banks 

cache 800 may be a set associative cache; each bank (e.g. »n a cache 440 (FIG. 4) that store valid data for the extended 

810.1) may be populated by a plurality of cache entries 5 block, an offset vector identifying a length of data to be 

organized into multiple ways. For simplicity, the example of retrieved from the extended block and, from the cache 

FIG. 8 illustrates only two ways 820.1, 830.1 for each bank directory, an order vector specifying bank order. 

810.1; there could be more. Each way 820.1, 830.1 may be Consider the example shown in FIG. 9. In this example, 

populated by a plurality of cache entries (labeled 821, 822, a firsl extended block XB1 may be distributed over two 

823, etc. for way 820.1). The cache entries eacb may include to banks (banl^ and bank,) ofa cache; a second extended block 

a first field T to store a tag identifier and a second field D to m ^ *f d * lnbu u ted ovcr {h *cc banks (bank* bank and 

store data to be retrieved from the cache. toi""^ Ass ? I mc jl lhal *™ [™™ C ** Ddcd blo( * 

XB1 necessarily flows to extended block XB2 as might 

In an embodiment, a bank 810.1 may include a plurality occur> f or example, from a return instruction. A conflict 

of comparators 840.1, 850.1, one provided for each way wou ld occur at banko if the full length of the extended block 

820.1, 830.1 of the bank 810.1. One input of each compara- 15 XB1 were required. However, because an extended block 

tor may be coupled to the output of the tag field T of the possesses a multiple -entry, single exit architecture, the full 

respective way 820.1, 830.1. Thus, comparator 840.1 is length of an extended block may not be required for each 

shown coupled to the tag field T of the first way 820.1 in prediction. In an extended block, program flow may enter an 

banko 810.1. A second input of each comparator 840.1, extended block at any instruction therein but, once it does, 

850.1 may be coupled to a common tag input for the bank. 20 program flow necessarily flows to the terminal instruction 

Thus, when tag data is retrieved from the ways 820.1, 8301 therein. Thus, a "referring instruction w from another instruc- 

of a bank 810.1, the tag data may be compared with an uon may determine at what poinl program flow will enter the 

externally supplied tag address. A comparator 840.1, 850.1 extended block. Thus, in an example, a segment predictor 

may generate a HIT signal if the data on its inputs match ma y « °° rd different bank vectors for the same extended 

each other. Because all tags in the same set in the same bank 25 block based on a 'referring instruction," the instruction that 

of a set associative cache must differ, only one of the causcd P ro & ram fl f ow 10 c n nter thc cxtcndcd bIock * 

comparators 840.1, 850.1 will generate a match. ^ ™ c example of FIG. 9 is continued in Table 1 below, 

r* l.u i omi r.u l. oaa * , ^ , Tablc 1 illustrates stored data that might be found in a 

bach bank 51U.1 ot the cacne buv may include a selection en( predictor 410 (Fia 4) . ^ first 

two rows identify 

multiplexer 870.1 coupled to the data portions of the two re f e rring instructions that point to XB1. For the first refer- 

ways 820.1, 830.1 according to an embodiment. The selec- ring instruction, a branch instruction, the segment predictor 

lion multiplexer 870.1 may be controlled by the output of the slore s the IP of XB1, a bank vector identifying two banks as 

tag comparators 840.1, 850.1. Thus, the selection multi- storing valid data and an offset identifying a length of data 

plexer 870.1 may propagate data from one of the ways, to be retrieved from the two banks. For the second referring 

depending upon which tag comparator 840.1, 850.1, if any, instruction, another branch instruction, the segment predic- 

indicates a match. 35 tor 410 stores the same IP (the IP of XB1) but the bank 

Each bank 810.1-810.N-1 may include an address vector identifies a single bank as storing valid data and 

decoder 880.1-880.N-1. In response to an applied address another offset value. The third referring instruction is the 

signal on its input, an address decoder (e.g. 880.1) may ^ minil ta™«ion in XB1; it stores the IP of XB2 and 

access a cache entry in each way and cause the contents 4Q rcs Pective bank and the onset vector, 
stored in the respective entry to be retrieved therefrom. 

According to an embodiment, the data supplied to the tag TABLE 1 

input for each bank may be derived from the IP of the 
instruction segment. Thus, although two instruction seg- 
ments may have sufficient commonality between their IPs to 
be stored in the same set within the cache, their IPs may be 
sufficiently different to have different tags. Thus, the cache 
structure 800 of FIG. 8 provides increase capacity over the 
embodiment of FIG. 3. 

A cache 800 having multiple ways 820.1, 830.1 is called 50 
an "associative cache." Associativity multiplies capacity of 

a cache linearly with the number ofwaysinthecachc.lt also nus > bank conflict between two extended blocks may 

contributes to reduced thrashing of data. Consider an depend upon the length of data to be retrieved from each. In 

example where two instruction segments having a length of u^c e^m pie above, there is no bank conflict between XB1 

10 instructions must be stored in a non-associative cache 55 a " d XB2 W ^ a , * ^ lhe f se u cond 

havingfour banks, each four instructions wide (See. FIG. 3). ™ G 9 Jl !f ralcs the TT « f 

tr.i. m r . ■ . . l • . l (FIG. 4) in this case. However, a bank conflict does occur 

11 the IPs 01 the two instruction segments having matching . * . a p »u c . u u * . 

& & . j when XB1 is entered from the first branch instruction, 

sets the two instructions segments could not be stored Sevefal embodimenls of the t invemion are 

simultaneously in the cache. Writing the second instruction cifical , y ilIuslraled and described herein. However, it will be 

segment into the cache would require over-writing data of 60 appreciatcd tnat modifications and variations of the present 

the first instruction segment. By contrast, an associative invention are covered by the above teachings and within the 

cache can accommodate thc two instruction segments; they purview of the appended claims without departing from the 

could be stored in thc same set but in different ways. The spirit and intended scope of the invention. 

associative cache 800 reduces thrashing of data stored in the \Ve claim: 

cacnc - 65 1 . A method of retrieving variable length data items stored 

Although the embodiments described herein find appli- in a cache having a predetermined number N of banks, 

cation for all classes of instruction segments, they may be comprising: 
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for a first data item to be read, determining a first number 
from 1 to N of the banks said first number indicating a 
number of the banks in which the first data item is 
stored, 

identifying and addressing the banks that store the first 
data item, 

for a second data item to be read, determining a second 
number from 1 to N of the banks said second number 
indicating a number of the banks in which the second 
data item is stored, 

addressing cache entries of any bank storing the second 
data item that does not overlap with the banks storing 
the first data item, and 

simultaneously reading data from all the addressed cache 
entries. 

2. The method of claim 1, wherein respective bank vectors 
identify, for each data item, the banks in which the data item 
is stored. 

3. The method of claim 1, further comprising reorganizing 
the data read from the cache according to an order vector. 

4. The member of claim 1, further comprising storing a 
portion of the second data unit that is discontinuous from the 
first data unit for a subsequent iteration of the method. 

5. The method of claim 4, further comprising, on a 
subsequent iteration: 

reading portions of the second data unit from the over- 
lapping banks, and 

for a third data item to be read, determining a number of 
banks that store portions of the third data item, 

identifying banks storing the portions of the third data 
item that are non-overlapping with the banks being read 
during the subsequent iteration, 

reading the non-overlapping banks storing portions of the 
third data item and, 

outputting the stored portions of the second data unit, the 
portions of the second data unit from the overlapping 
banks, and the read portions of the third data item. 

6. A method of retrieving variable length instruction 
segments stored in multiple banks of a cache, comprising: 

for a first instruction segment to be read, identifying the 
banks in which portions of the first instruction segment 
is stored, 

addressing cache entries within the identified banks that 

store the first instruction segment, 
for a second instruction segmenl to be read, identifying 

banks in which portions of the second instruction 

segment is stored, 
addressing cache entries of any bank storing the second 

instruction segment that does not overlap with the 

banks storing the first instruction segment, and 
simultaneously reading data from all the addressed cache 

entries. 
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7. The method of claim 6, wherein respective bank vectors 
identify, for each data item, the banks in which the data item 
is stored. 

8. The method of claim 7, further comprising comparing 
the bank vectors of the first and second data item to identify 
the non-overlapping banks. 

9. The method of claim 6, further comprising reorganizing 
the data read from the cache. 

10. The method of claim 6, wherein the instruction 
segments are traces. 

11. The method of claim 6, wherein the instruction 
segments are extended blocks. 

12. The method of claim 6, further comprising: 
outputting the first instruction segment and any non- 
blocked portion of the second instruction segment that 
are adjacent to the first instruction segment in program 
flow, and 

storing for a subsequent iteration of the method, any 
non-blocked portion of the second instruction segment 
that is not adjacent to the first instruction segment in 
program flow. 

13. A cache system, comprising: 

a plurality of cache banks, each having entries of a 
predetermined width, the cache banks to store instruc- 
tion segments, an instruction segment being stored in 
from one to all of the banks, and 
an address generator: 

responsive to an instruction pointer of a desired first 
instruction segment, to identify which of the banks 
store portions of the first instruction segment and to 
address the identified banks, wherein from one to all 
of the banks arc identified, and 
responsive to an instruction pointer of a desired second 
instruction segment, to identify which of the banks 
store portions of the second instruction segment, to 
determine if any of the banks storing portions of the 
second instruction segment are non-overlapping with 
the banks storing portions of the first instruction 
pointer and, if so, to address the non-overlapping 
banks. 

14. The cache system of claim 13, wherein the cumulative 
width of all of the banks equals a maximum permissible 
length of an instruction segment. 

15. The cache system of claim 13, further comprising 
multiple layers of multiplexers to reorder an output of the 
addressed banks according to a program order. 

16. The cache system of claim 13, wherein each of the 
cache banks has a set associative structure including mul- 
tiple ways. 

17. The cache system of claim 13, wherein the instruction 
segments are traces. 

18. The cache system of claim 13, wherein the instruction 
segments are extended blocks. 
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